{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KLHiTPXNTf2a"
      },
      "source": [
        "##### Copyright 2025 Google LLC."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "oTuT5CsaTigz"
      },
      "outputs": [],
      "source": [
        "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZNM-D0pLXZeR"
      },
      "source": [
        "# Gemini API: Code analysis using LangChain and DeepLake"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PRZo8H09Bs6u"
      },
      "source": [
        "<a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/langchain/Code_analysis_using_Gemini_LangChain_and_DeepLake.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" height=30/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bbf4f2c17530"
      },
      "source": [
        "<!-- Princing warning Badge -->\n",
        "<table>\n",
        "  <tr>\n",
        "    <!-- Emoji -->\n",
        "    <td bgcolor=\"#f5949e\">\n",
        "      <font size=30>⚠️</font>\n",
        "    </td>\n",
        "    <!-- Text Content Cell -->\n",
        "    <td bgcolor=\"#f5949e\">\n",
        "      <h3><font color=black>This notebook requires paid tier rate limits to run properly.<br>  \n",
        "(cf. <a href=\"https://ai.google.dev/pricing#veo2\"><font color='#217bfe'>pricing</font></a> for more details).</font></h3>\n",
        "    </td>\n",
        "  </tr>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mOGNjAZMwMIk"
      },
      "source": [
        "This notebook shows how to use Gemini API with [Langchain](https://python.langchain.com/v0.2/docs/introduction/) and [DeepLake](https://www.deeplake.ai/) for code analysis. The notebook will teach you:\n",
        "- loading and splitting files\n",
        "- creating a Deeplake database with embedding information\n",
        "- setting up a retrieval QA chain"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jlzRUaWguYiE"
      },
      "source": [
        "### Load dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6BiMHjZuRkQM"
      },
      "outputs": [],
      "source": [
        "%pip install -q -U langchain-google-genai langchain-deeplake langchain langchain-text-splitters langchain-community"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "FAsv4ybKOiUK"
      },
      "outputs": [],
      "source": [
        "from glob import glob\n",
        "from IPython.display import Markdown, display\n",
        "\n",
        "from langchain.document_loaders import TextLoader\n",
        "from langchain_text_splitters import (\n",
        "    Language,\n",
        "    RecursiveCharacterTextSplitter,\n",
        ")\n",
        "from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings\n",
        "from langchain.chains import RetrievalQA\n",
        "from langchain_deeplake.vectorstores import DeeplakeVectorStore"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FQOGMejVu-6D"
      },
      "source": [
        "### Configure your API key\n",
        "\n",
        "To run the following cell, your API key must be stored in a Colab Secret named `GEMINI_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](../../quickstarts/Authentication.ipynb) for an example.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ysayz8skEfBW"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from google.colab import userdata\n",
        "GEMINI_API_KEY=userdata.get('GEMINI_API_KEY')\n",
        "\n",
        "os.environ[\"GEMINI_API_KEY\"] = GEMINI_API_KEY"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vUwX1PxWg31O"
      },
      "source": [
        "## Prepare the files"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-ye873pizjeR"
      },
      "source": [
        "First, download a [langchain-google](https://github.com/langchain-ai/langchain-google) repository. It is the repository you will analyze in this example.\n",
        "\n",
        "It contains code integrating Gemini API, VertexAI, and other Google products with langchain."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xa5Om2YJZMs1"
      },
      "outputs": [],
      "source": [
        "!git clone https://github.com/langchain-ai/langchain-google"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "w4M2xVb9zlbp"
      },
      "source": [
        "This example will focus only on the integration of Gemini API with langchain and ignore the rest of the codebase."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "nKXFtWDSz87m"
      },
      "outputs": [],
      "source": [
        "repo_match = \"langchain-google/libs/genai/langchain_google_genai**/*.py\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "w8ecqCn8z8l5"
      },
      "source": [
        "Each file with a matching path will be loaded and split by `RecursiveCharacterTextSplitter`.\n",
        "In this example, it is specified, that the files are written in Python. It helps split the files without having documents that lack context."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "PYU4TJmXZrHF"
      },
      "outputs": [],
      "source": [
        "docs = []\n",
        "for file in glob(repo_match, recursive=True):\n",
        "  loader = TextLoader(file, encoding='utf-8')\n",
        "  splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=2000, chunk_overlap=0)\n",
        "  docs.extend(loader.load_and_split(splitter))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_nBTLdVD34Mg"
      },
      "source": [
        "`Language` Enum provides common separators used in most popular programming languages, it lowers the chances of classes or functions being split in the middle."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "Qtp7Vg0835LR"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['\\nclass ', '\\ndef ', '\\n\\tdef ', '\\n\\n', '\\n', ' ', '']"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# common seperators used for Python files\n",
        "RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "80xuFyotsaZs"
      },
      "source": [
        "## Create the database\n",
        "The data will be loaded into the memory since the database doesn't need to be permanent in this case and is small enough to fit.\n",
        "\n",
        "The type of storage used is specified by prefix in the path, in this case by `mem://`.\n",
        "\n",
        "Check out other types of storage [here](https://docs.activeloop.ai/setup/storage-and-creds/storage-options)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "VKBZTFWUuYLA"
      },
      "outputs": [],
      "source": [
        "# define path to database\n",
        "dataset_path = 'mem://deeplake/langchain_google'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "id": "ak3qzDgRuY4S"
      },
      "outputs": [],
      "source": [
        "# define the embedding model\n",
        "embeddings = GoogleGenerativeAIEmbeddings(model=\"models/gemini-embedding-001\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "27Cr4TkMualL"
      },
      "source": [
        "Everything needed is ready, and now you can create the database. It should not take longer than a few seconds."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {
        "id": "1XCQBbW0g7M0"
      },
      "outputs": [],
      "source": [
        "db = DeeplakeVectorStore.from_documents(\n",
        "    dataset_path=dataset_path,\n",
        "    embedding=embeddings,\n",
        "    documents=docs,\n",
        "    overwrite=True\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R594TesMl6Pl"
      },
      "source": [
        "## Question Answering"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F4Ih7Hveuzkj"
      },
      "source": [
        "Set-up the document retriever."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "id": "DtEUchpAmBFQ"
      },
      "outputs": [],
      "source": [
        "retriever = db.as_retriever()\n",
        "retriever.search_kwargs['distance_metric'] = 'cos'\n",
        "retriever.search_kwargs['k'] = 20 # number of documents to return"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {
        "id": "rvF5W6OvvAXB"
      },
      "outputs": [],
      "source": [
        "# define the chat model\n",
        "llm = ChatGoogleGenerativeAI(model = \"gemini-2.5-flash\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fqM2_zAvwpgJ"
      },
      "source": [
        "Now, you can create a chain for Question Answering. In this case, `RetrievalQA` chain will be used.\n",
        "\n",
        "If you want to use the chat option instead, use `ConversationalRetrievalChain`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {
        "id": "G__78Wv1nsNd"
      },
      "outputs": [],
      "source": [
        "qa = RetrievalQA.from_llm(llm, retriever=retriever)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3SRcc8LGx6ki"
      },
      "source": [
        "The chain is ready to answer your questions.\n",
        "\n",
        "NOTE: `Markdown` is used for improved formatting of the output."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {
        "id": "VE30oIsHFZJs"
      },
      "outputs": [],
      "source": [
        "# a helper function for calling retrival chain\n",
        "def call_qa_chain(prompt):\n",
        "  response = qa.invoke(prompt)\n",
        "  display(Markdown(response[\"result\"]))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {
        "id": "g4ttIFvmn392"
      },
      "outputs": [
        {
          "data": {
            "text/markdown": [
              "```\n",
              "_BaseGoogleGenerativeAI\n",
              "    └── BaseModel\n",
              "```"
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "call_qa_chain(\"Show hierarchy for _BaseGoogleGenerativeAI. Do not show content of classes.\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 33,
      "metadata": {
        "id": "VL8FXqQKgRB9"
      },
      "outputs": [
        {
          "data": {
            "text/markdown": [
              "The embedding models, specifically `GoogleGenerativeAIEmbeddings`, have two main methods for generating embeddings:\n",
              "\n",
              "1.  **`embed_query`**: Returns a single embedding for a given text as a `List[float]`.\n",
              "2.  **`embed_documents`**: Returns a list of embeddings (one for each text in the input list) as a `List[List[float]]`."
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "call_qa_chain(\"What is the return type of embedding models.\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 34,
      "metadata": {
        "id": "3VEO6-TwE0MN"
      },
      "outputs": [
        {
          "data": {
            "text/markdown": [
              "The classes related to Attributed Question and Answering (AQA) are:\n",
              "\n",
              "1.  **`GenAIAqa`**: This is the main class representing Google's Attributed Question and Answering service.\n",
              "2.  **`AqaInput`**: Defines the input structure for the `GenAIAqa.invoke` method, including the prompt and source passages.\n",
              "3.  **`AqaOutput`**: Defines the output structure from the `GenAIAqa.invoke` method, containing the answer, attributed passages, and answerable probability.\n",
              "4.  **`_AqaModel`**: An internal wrapper class used by `GenAIAqa` to interact with the underlying Google AQA model.\n",
              "5.  **`GroundedAnswer`**: A dataclass used internally by the AQA implementation to represent a grounded answer, including the answer text, attributed passages, and answerable probability.\n",
              "6.  **`GoogleVectorStore`**: This class has an `as_aqa()` method which allows it to be used to construct a Google Generative AI AQA engine, providing passages from the vector store for grounding.\n",
              "7.  **`_SemanticRetriever`**: This is an internal component of `GoogleVectorStore` that handles the retrieval of passages, which are then used by the AQA service when integrated through `GoogleVectorStore.as_aqa()`."
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "call_qa_chain(\"What classes are related to Attributed Question and Answering.\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 35,
      "metadata": {
        "id": "mNtpXjYKx9Ye"
      },
      "outputs": [
        {
          "data": {
            "text/markdown": [
              "The `GenAIAqa` class has the following dependencies:\n",
              "\n",
              "1.  **`RunnableSerializable`**: It inherits from `langchain_core.runnables.RunnableSerializable`.\n",
              "2.  **`AqaInput`**: It uses `AqaInput` as its input type for the `invoke` method.\n",
              "3.  **`AqaOutput`**: It returns `AqaOutput` from its `invoke` method.\n",
              "4.  **`_AqaModel`**: It internally uses an instance of `_AqaModel` to interact with the Google GenAI service.\n",
              "5.  **`google.ai.generativelanguage` (as `genai`)**: The `_AqaModel` class, which `GenAIAqa` depends on, directly uses components from `google.ai.generativelanguage` (e.g., `GenerativeServiceClient`, `AnswerStyle`, `SafetySetting`).\n",
              "6.  **`_genai_extension` (as `genaix`)**: The `_AqaModel` class uses functions from this internal module (e.g., `build_generative_service`, `generate_answer`, `GroundedAnswer`)."
            ],
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "call_qa_chain(\"What are the dependencies of the GenAIAqa class?\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wBtm3YM-7vxD"
      },
      "source": [
        "## Summary\n",
        "\n",
        "Gemini API works great with Langchain. The integration is seamless and provides an easy interface for:\n",
        "- loading and splitting files\n",
        "- creating DeepLake database with embeddings\n",
        "- answering questions based on context from files"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iwxEpyvx1jbU"
      },
      "source": [
        "## What's next?\n",
        "\n",
        "This notebook showed only one possible use case for langchain with Gemini API. You can find many more [here](../../examples/langchain)."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "Code_analysis_using_Gemini_LangChain_and_DeepLake.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
